Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

210 ◾ Bioinformatics

across the whole genome. Genes code for proteins that determine traits and biological

functions. RNA-Seq allows investigation of the entire transcriptome through the profiling

of genes so researchers will know which sets of genes are coregulated during the conditions

studied. RNA-Seq is used to study diseases like cancers and patient’s response to therapeu-

tics and other diseases.

The first steps of the workflow of RNA-Seq are similar to the steps of the other NGS

applications as discussed in the previous chapters. The raw data is obtained in FASTQ

files, which can be single end or paired end. The raw data is subjected to quality con-

trol for the trimming of adaptors and removal of duplicate sequences and low-quality

reads. The cleaned raw data is either aligned to a reference genome or a reference tran-

scriptome. The alignment program should be chosen depending on the read length and

splicing awareness. STAR is a good choice for aligning short RNA-Seq reads. However, it

requires sufficient memory and storage space for the indexing of the reference sequence

and read alignment. The alignment information is stored in SAM/BAM files. Programs

like Samtools and PICARD can be used to manipulate and generate alignment statistics

from the BAM files. Quantification of reads aligned to each gene is essential in the RNA-

Seq data analysis because it is the only way to see the gene transcript abundance, which is

an indication for its biological activities. The reads aligned to each gene in BAM files are

counted using reads counting programs like htseq-count and featureCount. In addition to

the BAM files as inputs, these programs require also a reference annotation file in GTF for-

mat. The reads counts produced by the counting programs are stored in a tab-delimited file

containing gene symbols or gene IDs and the corresponding counts of the aligned reads

for each gene. The count file may contain a count column for each sample. Sample counts

in different files can be merged for easy processing. The count data is usually normalized

to be suitable for within-sample gene comparisons and between-sample gene comparisons.

The normalization is made to adjust for the biases arising from the differences in the gene

lengths, GC contents, and library sizes across samples. The normalization methods like

RPKM/FPKM and CPM adjust for the biases arising from the difference in gene lengths;

therefore, they can be used when comparison between the expressions of different genes

within the sample is intended. Other normalization methods like TMM and RLE adjust

for the bias arising from the difference in the library sizes across samples; therefore, they

can be used for between-sample differential expression analysis. The count data generated

by reads counting programs are further analyzed by the differential programs like edgeR,

DESeq2, and cuffdiff. As an example, we used edgeR to demonstrate the steps of the dif-

ferential analysis of the RNA-Seq data.

The differential analysis requires a metadata file that holds the information of the

samples and the study design from which the design matrix is generated. Genes with low

abundance can be filtered out since they do not contribute to the differential analysis. The

normalized count data as response variable and the design matrix as explanatory vari-

ables are used for a GLM fitting. Since the count data is over-dispersed (variance is greater

than the mean), the negative binomial distribution is assumed to fit the RNA-Seq count

data to a log-linear model to estimate the dispersions and other statistics for the differen-

tial gene expression including log-fold change, p-values, and false discovery rates (FDR).